-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add File source with SQS notifications #5148
Conversation
93cce59
to
056f785
Compare
quickwit/quickwit-metastore/src/metastore/postgres/queries/shards/open.sql
Outdated
Show resolved
Hide resolved
quickwit/quickwit-indexing/src/source/queue_sources/visibility.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-indexing/src/source/queue_sources/visibility.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-indexing/src/source/queue_sources/processor.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-indexing/src/source/queue_sources/local_state.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-indexing/src/source/queue_sources/local_state.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-indexing/src/source/queue_sources/processor.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-indexing/src/source/queue_sources/processor.rs
Outdated
Show resolved
Hide resolved
quickwit/quickwit-indexing/src/source/queue_sources/processor.rs
Outdated
Show resolved
Hide resolved
we need a different handling of transient vs non-transient error. |
cbcbfda
to
0c93920
Compare
) -> anyhow::Result<BatchBuilder> { | ||
let mut batch_builder = BatchBuilder::new(source_type); | ||
while batch_builder.num_bytes < BATCH_NUM_BYTES_LIMIT { | ||
let mut buf = String::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why allocate at every single loop iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from main
let mut doc_line = String::new(); |
I guess we could allocate one big memory region of size
BATCH_NUM_BYTES_LIMIT
, load data into it and slice it by row. I think it's a bit far from this PR's concern though. Do you think these allocations are costly enough for opening an issue?
Description
This PR proposes the generic implementation of a "queue" source. For now, only an implementation for AWS SQS with its data backed by AWS S3 is exposed to the users. Google Pubsub as the queue implementation or inlined data (i.e messages containing the data itself and not the link to the object store) will come next.
We use the shard API to provide deduplication of messages. For the current implementation where the source data is stored on S3, the deduplication is made on the object URI.
High level summary of the abstractions that are part of the generic implementation:
Processor
exposes the exact same methods as theSource
trait but does not implement it directly. Instead, the concrete queue sources (e.g.SqsSource
) wrap theProcessor
.RawMessage
: the message as received from the QueuePreProcessedPayload
: the message went through the minimal transformation to discover its partition idCheckpointedMessage
: the message was checked against the shared state (shard API), it is now ready to be processedInProgressMessage
: the message that is actively being readQueueSharedState
is an abstraction over shard API. By callingopen_shard
upon reception of the messages we avoid costly redundant processing when receiving a duplicate message.QueueLocalState
represents the state machine of the messages as they are processed by the indexing pipelineVisibilityTaskHandle
a task that extends the message visibility when required (needs to be reworked)TODO:
Processor
abstractionQueue
traitopen_shard
API to acceptpublish_token
as a field. This gives upsert semantics to the API which makes it possible to acquire the shard upon creation.SourceConfig.use_shard_api()
)Processor
abstractionShardState
abstractionSqsSource
(with some small refactoring to reuse thesetup_index
helper from theKafkaSource
)Publisher
actorTODO in subsequent PRs:
How was this PR tested?
This PR contains unit tests and higher level tests that use LocalStack.